compute engine
Resource Utilization Optimized Federated Learning
Zhang, Zihan, Wong, Leon, Varghese, Blesson
Zihan Zhang University of St Andrews, UK Leon Wong Rakuten Mobile, Inc., Japan Blesson V arghese University of St Andrews, UK Abstract --Federated learning (FL) systems facilitate distributed machine learning across a server and multiple devices. However, FL systems have low resource utilization limiting their practical use in the real world. This inefficiency primarily arises from two types of idle time: (i) task dependency between the server and devices, and (ii) stragglers among heterogeneous devices. This paper introduces FedOptima, a resource-optimized FL system designed to simultaneously minimize both types of idle time; existing systems do not eliminate or reduce both at the same time. First, devices operate independently of each other using asynchronous aggregation to eliminate straggler effects, and independently of the server by utilizing auxiliary networks to minimize idle time caused by task dependency. Second, the server performs centralized training using a task scheduler that ensures balanced contributions from all devices, improving model accuracy. Four state-of-the-art offloading-based and asynchronous FL methods are chosen as baselines. Experimental results show that compared to the best results of the baselines on convolutional neural networks and transformers on multiple lab-based testbeds, FedOptima (i) achieves higher or comparable accuracy, (ii) accelerates training by 1.9 to 21.8, (iii) reduces server and device idle time by up to 93.9% and 81.8%, respectively, and (iv) increases throughput by 1.1 to 2.0 . Index T erms--federated learning, distributed system, resource utilization, idle time, edge computing I. I NTRODUCTION Federated learning (FL) [1]-[3] offers distributed training across user devices as an alternative to traditional centralized machine training. Devices train a deep neural network (DNN) on their data and send model parameters to the server. The server aggregates these into a global model, which is then distributed to the devices for the next round. Thus, FL utilizes insight from user data via local models to train a global model without sharing original data. Sub-optimal resource utilization is a critical problem in FL that results in two types of idle time on the server and devices (see Section II-A). The first is due to task dependency between server and devices - the server is idle for considerable periods when aggregating local models from devices as it waits for on-device training to complete, which is usually time-consuming. The second is due to hardware heterogeneity - stragglers or slower devices require more time to train than faster devices that idle while waiting for the stragglers. Two categories of methods are considered in the existing literature for reducing idle time.
Late Breaking Results: Energy-Efficient Printed Machine Learning Classifiers with Sequential SVMs
Besias, Spyridon, Sertaridis, Ilias, Afentaki, Florentia, Balaskas, Konstantinos, Zervakis, Georgios
Printed Electronics (PE) provide a mechanically flexible and cost-effective solution for machine learning (ML) circuits, compared to silicon-based technologies. However, due to large feature sizes, printed classifiers are limited by high power, area, and energy overheads, which restricts the realization of battery-powered systems. In this work, we design sequential printed bespoke Support Vector Machine (SVM) circuits that adhere to the power constraints of existing printed batteries while minimizing energy consumption, thereby boosting battery life. Our results show 6.5x energy savings while maintaining higher accuracy compared to the state of the art.
- Energy > Energy Storage (0.56)
- Electrical Industrial Apparatus (0.56)
RescueSNN: Enabling Reliable Executions on Spiking Neural Network Accelerators under Permanent Faults
Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad
To maximize the performance and energy efficiency of Spiking Neural Network (SNN) processing on resource-constrained embedded systems, specialized hardware accelerators/chips are employed. However, these SNN chips may suffer from permanent faults which can affect the functionality of weight memory and neuron behavior, thereby causing potentially significant accuracy degradation and system malfunctioning. Such permanent faults may come from manufacturing defects during the fabrication process, and/or from device/transistor damages (e.g., due to wear out) during the run-time operation. However, the impact of permanent faults in SNN chips and the respective mitigation techniques have not been thoroughly investigated yet. Toward this, we propose RescueSNN, a novel methodology to mitigate permanent faults in the compute engine of SNN chips without requiring additional retraining, thereby significantly cutting down the design time and retraining costs, while maintaining the throughput and quality. The key ideas of our RescueSNN methodology are (1) analyzing the characteristics of SNN under permanent faults; (2) leveraging this analysis to improve the SNN fault-tolerance through effective fault-aware mapping (FAM); and (3) devising lightweight hardware enhancements to support FAM. Our FAM technique leverages the fault map of SNN compute engine for (i) minimizing weight corruption when mapping weight bits on the faulty memory cells, and (ii) selectively employing faulty neurons that do not cause significant accuracy degradation to maintain accuracy and throughput, while considering the SNN operations and processing dataflow. The experimental results show that our RescueSNN improves accuracy by up to 80% while maintaining the throughput reduction below 25% in high fault rate (e.g., 0.5 of the potential fault locations), as compared to running SNNs on the faulty chip without mitigation. In this manner, the embedded systems that employ RescueSNN-enhanced chips can efficiently ensure reliable executions against permanent faults during their operational lifetime.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > New York (0.04)
- Education (0.69)
- Information Technology (0.68)
- Semiconductors & Electronics (0.46)
SambaNova Doubles Up Chips To Chase AI Foundation Models
One of the first tenets of machine learning, which is a very precise kind of data analytics and statistical analysis, is that more data beats a better algorithm every time. A consensus is emerging in the AI community that a large foundation model with hundreds of billions to trillions of parameters is going to beat a highly tuned model on a small subset of relevant data every time. If this turns out to be true, it will have significant implications for AI system architecture as well as who will likely be able to afford having such ginormous foundation models in production. Our paraphrasing of "more data beats a better algorithm" is a riff on a quote from Peter Norvig, an education fellow at Stanford University and a researcher and engineering director at Google for more than two decades, who co-authored the seminal paper The Unreasonable Effectiveness of Data back in 2009, long before machine learning went mainstream but when big data was amassing and changing the nature of data analytics and giving great power to the hyperscalers who gathered it as part of the services they offered customers. "But invariably, simple models and a lot of data trump more elaborate models based on less data," Norvig wrote, and since that time, he has been quoted saying something else: "More data beats clever algorithms, but better data meets more data."
SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors
Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad
Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (2 more...)
GraphCore Goes Full 3D With AI Chips
The 3D stacking of chips has been the subject of much speculation and innovation in the past decade, and we will be the first to admit that we have been mostly thinking about this as a way to cram more capacity into a given compute engine while at the same time getting components closer together along the Z axis and not just working in 2D anymore down on the X and Y axes. It was extremely interesting to see, then, the 3D wafer-on-wafer stacking that AI chip and system upstart GraphCore has been working on with Taiwan Semiconductor Manufacturing Co had nothing to do making logic circuits more dense within a socket. This will happen over time, of course, but the 3D wafer stacking that GraphCore and TSMC have been exploring together and are delivering in the third generation "Bow" GraphCore IPU – the systems based on them bear the same nickname – is about creating a power delivery die that is bonded to the bottom of the existing compute die. The effect of this innovation is that GraphCore can get a more even power supply to the IPU, and therefore it can drop the voltage on its circuits and therefore increase the clock frequency while at the same time burning less power. The grief and cost of doing this power supply wafer and stacking the IPU wafer on top are outweighed by the performance and thermal benefits on the IPU, and therefore GraphCore and its customers come out ahead on the innovation curve.
- Asia > Taiwan (0.25)
- Europe > United Kingdom > England > Buckinghamshire > Milton Keynes (0.05)
- Semiconductors & Electronics (1.00)
- Information Technology > Hardware (0.36)
Google Teaches AI To Play The Game Of Chip Design
If it wasn't bad enough that Moore's Law improvements in the density and cost of transistors is slowing. At the same time, the cost of designing chips and of the factories that are used to etch them is also on the rise. Any savings on any of these fronts will be most welcome to keep IT innovation leaping ahead. One of the promising frontiers of research right now in chip design is using machine learning techniques to actually help with some of the tasks in the design process. We will be discussing this at our upcoming The Next AI Platform event in San Jose on March 10 with Elias Fallon, engineering director at Cadence Design Systems.
Deploying a ML Model on Google Compute Engine - WebSystemer.no
Flask is not a web server. It is a micro web application framework, a set of tools and libraries that make it easier and prettier to build web applications. Flask comes with Werkzeug, a WSGI utility library that provides a simple web server for development purposes. While Flask's development server is good enough to test the main functionality of the app, we shouldn't use it in production. While lightweight and easy to use, Flask's built-in server is not suitable for production as it doesn't scale well and by default serves only one request at a time.
ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning
Falahati, Hajar, Lotfi-Kamran, Pejman, Sadrosadati, Mohammad, Sarbazi-Azad, Hamid
Memory bandwidth bottleneck is a major challenges in processing machine learning (ML) algorithms. In-memory acceleration has potential to address this problem; however, it needs to address two challenges. First, in-memory accelerator should be general enough to support a large set of different ML algorithms. Second, it should be efficient enough to utilize bandwidth while meeting limited power and area budgets of logic layer of a 3D-stacked memory. We observe that previous work fails to simultaneously address both challenges. We propose ORIGAMI, a heterogeneous set of in-memory accelerators, to support compute demands of different ML algorithms, and also uses an off-the-shelf compute platform (e.g.,FPGA,GPU,TPU,etc.) to utilize bandwidth without violating strict area and power budgets. ORIGAMI offers a pattern-matching technique to identify similar computation patterns of ML algorithms and extracts a compute engine for each pattern. These compute engines constitute heterogeneous accelerators integrated on logic layer of a 3D-stacked memory. Combination of these compute engines can execute any type of ML algorithms. To utilize available bandwidth without violating area and power budgets of logic layer, ORIGAMI comes with a computation-splitting compiler that divides an ML algorithm between in-memory accelerators and an out-of-the-memory platform in a balanced way and with minimum inter-communications. Combination of pattern matching and split execution offers a new design point for acceleration of ML algorithms. Evaluation results across 12 popular ML algorithms show that ORIGAMI outperforms state-of-the-art accelerator with 3D-stacked memory in terms of performance and energy-delay product (EDP) by 1.5x and 29x (up to 1.6x and 31x), respectively. Furthermore, results are within a 1% margin of an ideal system that has unlimited compute resources on logic layer of a 3D-stacked memory.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Asia > Middle East > Iran (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
More Details Emerge About Arm's Machine Learning
Arm is definitely targeting deep-neural-network (DNN) machine-learning (ML) applications with its proposed hardware designs, but its initial ML hardware descriptions were a bit vague (Figure 1). Though the final details aren't ready yet, ARM has exposed more of the architecture. The Arm ML processor is supported by the company's Neural Network (NN) software development kit that bridges the interface between ML software and the underlying hardware. This allows developers to target Arm's CPU, GPU, and ML processors. In theory, waiting for the ML hardware will allow a critical mass of software to be available when the real hardware finally arrives.